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Abstract 

We study the problem of learning fc -juntas given access to examples drawn from a number of dif- 
ferent product distributions. Thus we wish to learn a function / : { — 1, 1}" — * { — 1, 1} that depends 
on k (unknown) coordinates. While the best known algorithms for the general problem of learning a 
fc -junta require running time of n k poly(n, 2 fc ), we show that given access to k different product distri- 
butions with biases separated by 7 > 0, the functions may be learned in time poly(n, 2 k , 7 ). More 
generally, given access to t < k different product distributions, the functions may be learned in time 
rl fc / t poly(n, 2 k , 7~ fc ). Our techniques involve novel results in Fourier analysis relating Fourier expan- 
sions with respect to different biases and a generalization of Russo's formula. 

Keywords: learning juntas, PAC learning, biased product distributions, Fourier analysis of Boolean func- 
tions, Russo's formula 

1 Introduction 

1.1 Motivation 

A k-junta is a function / : { — 1, 1}™ — > { — 1, 1} that only depends on a subset of k variables x^, . . . , X{ k . 
Blum and Langley [6] proposed the problem of learning the class of fc-juntas, which we refer to as the junta 
learning probem, as a clean and appealing model of learning in the presence of much irrelevant information. 
It is considered to be among the most important problems in computational learning theory to date [4, 
23]. In addition to being an interesting class in itself, the importance of learning juntas is supported by its 
connections to learning decision trees and DNFs, see [23]. Mossel, O'Donnell, and Servedio [23] observed 
that junta learning is efficiently solvable in the membership query model and in the random walk model, 
whereas it is provably hard in the statistical query model. What lies in between is the uniform distribution 
PAC model for which [23] presented an algorithm with running time roughly n°- 7 ' k , being the currently 
best improvement upon a straightforward algorithm that runs in roughly n k steps. For general distributions, 
no such improvement is known. The little progress on the junta learning problem in the PAC model to date 
might be considered evidence of the hardness of the problem in this model. At the same time, however, no 
lower bounds are available, either. 

Apart from devising fast learning algorithms, another goal is often to have low sample complexity (i.e., 
a small number of examples needed to learn). Information-theoretically, ©(fclogn + 2 k ) examples are 
necessary and sufficient for learning fc-juntas on n bits ([7, 26, 1]). The algorithm of [23], however, needs 
to draw roughly n°- 3k examples in the worst case. 

It thus seems reasonable to ask if we can find a natural extension of the PAC learning model under fixed 
distributions that admits junta learning algorithms that run in time t{k) ■ poly(n) for some function t that is 
independent of n and some polynomial that is independent of k. Moreover, such algorithms should ideally 
use s(k) • 0(log n) examples for some function s independent of n. 

In this paper, we propose such a model: instead of giving the learner access to only one oracle, we study 
the setting in which a learner has access to multiple oracles that generate examples according to different 



distributions. Although in this paper, we are mainly interested in learning from product distributions, we 
introduce the model in more generality since we believe that studying the learnability of other classes in 
this model, possibly under less restricted distributions, is a worthwhile goal for future research. In data 
mining and applied machine learning, researchers often depart from the assumption of having access to 
only one source of data in order to capture more realistic scenarios such as having multiple sources of 
different quality [10, 11], receiving partial information about tuples of examples [12], or observing sets of 
different attributes for the same examples [21]. We mention three possible real-world learning scenarios 
in which our model can be applied: e.g., the examples could be obtained as series of measurements in 
certain experimental setups, so that different oracles correspond to different setups, resulting in different 
distributions over the instance space. Or, examples could be sampled from disjoint populations in which 
the distributions of attributes differ significantly. Another application comes into mind when considering 
data generated by a mixture of distributions. After applying algorithms to tell the distributions apart (say, 
from unlabeled examples) [19, 28, 13], one could use algorithms designed for the model of learning from 
multiple distributions to finally learn the concept under consideration. 

For our results on the junta learning problem, we consider r -biased oracles that generate examples 
(x, /(x)) according to r-biased product distributions fi r on {—1, l} n for biases r £ (—1, 1). These are 
distributions such that every variable x\ independently takes on values — 1 and +1 with probability (1 — r)/2 
and (1 + r)/2, respectively (so that E^ r [xi] = r). 

As in the setting with one uniform distribution oracle [23] (this is the case r = 0), we show that the 
junta learning problem from multiple oracles reduces to the task of identifying at least one relevant variable. 
In general, a conceptual method to identify relevant variables is to find non-vanishing Fourier coefficients 
f(S,r), S C [n], where r denotes the bias of the underlying distribution. The Fourier coefficient f(S,r) 
measures the correlation between the function value /(x) and the function Xs( x > r ) = lLer( x * ~~ r ) ( see 
Section 2.3 for details). Most of the literature focuses on the case r = 0, in which xs( x > 0) reduces to the 
parity of the variables indexed by S, and f(S, 0) is commonly denoted by f(S). The point is that whenever 
f(S, r) / 0, then all variables xi with i G 5 are relevant. If we pursue the search by starting with singletons 
S and then move on to higher levels, this method takes time about n s if f(S, r) = for all S of size up to s. 
The question is how to proceed if such a situation occurs for some s £ oo(l). In [23] it is proposed to use a 
second approach based on the calculation of the coefficients of the polynomial representation of / over the 
two-element field and shown a trade-off between s and the degree of this polynomial. In a different direction, 
Atici and Servedio [3] enhance the uniform PAC model by a quantum subroutine to circumvent exhaustive 
search for non-zero Fourier coefficients. Our solution to give the learner access to several (classical) oracles 
can be considered as another (and maybe more realistic) alternative. 

From a conceptual viewpoint, our main result shows that the junta learning problem is efficiently solv- 
able in a passive learning model (as opposed to allowing the learner to actively ask membership queries) 
with independent random examples (as opposed to learning from, say, random walks, where examples are 
highly correlated). 

1.2 Our Results 

We solve the problem of vanishing Fourier coefficients up to high levels by considering Fourier coefficients 
with respect to multiple distributions: we show that if all Fourier coefficients f(S, ri) of a fc-junta / vanish 
up to level s with respect to t different biases r±, . . . , r t , then s -t < deg(f), where 

deg(/)=max{|S||/(S)^o}<* 
is the degree of /. Specifically, we prove 
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Theorem 1. Let f : {— l,l} ra — > {—1,1} be non-constant function and s,t G N be such that s ■ t > deg(f). 
Let n, . . . ,rt G (— 1, 1) arbitrary pairwise different biases. Then there exists ani G [i] a«<i a set S C [n] 
with 1 < \S\ < s such that f(S, r^) / 0. 

Letting s = 1 and t = k, Theorem 1 implies that there are at most k — 1 different biases r such that all 
r-biased first-level Fourier coefficients of / vanish. As a consequence, whenever a learner has access to k r- 
biased oracles for k pairwise distinct biases r, it suffices to consider, for each given bias r, only coefficients 
f(S, r) at all singletons S in order to find at least one relevant variable. The main technical issue we have 
to take care of is that Theorem 1 does not rule out the possibility that \f(S, ri)\ could be extremely small, 
so that it would require a large amount of examples to tell whether a coefficient is nonzero. To take this 
into account, we add the requirement that the biases are well separated, i.e., have pairwise distance at least 
7 > 0. In addition, we allow the running time to also depend on the inverse of the minimum distance of the 
biases to —1 or 1 since the degenerate cases r = -1 or r = 1 only produce the single example (r, f(f)), 
from which we cannot learn anything. Here r denotes the vector with all n entries equal to r. Our main 
learning theory application of Theorem 1 (in the special case s = 1 and t = k) is: 

Theorem 2. Let — 1 + a < r\ < ... < < 1 — a for some a > such that for all i £ [k — 1], 

ri + i — Ti > 7 > 0. Then the class ofk-juntas is exactly learnable with access to ri-biased oracles, i 6 [k], 
from m = poly(logn, 2 fc , (1/7)^, 1/a, log(l/<5)) examples in time poly(m,n). 

Theorem 2 immediately follows from the following generalization which is based on the general case 
(s • t > k) in Theorem 1 . The trade-off between the number of r-biased oracles to which a learner has access 
and the level up to which the learner has to inspect the Fourier coefficients results in a trade-off between the 
number of oracles and the running time: 

Theorem 3. Let k,s,t G N such that s ■ t > k and — 1 + a < 77 < ... < r< < 1 — afar some a > 
such that for all i €. [t — 1], rj+i — T{ > 7 > 0. Then the class ofk-juntas is exactly learnable with access 
to ri-biased oracles, i £ [t], using m = poly(log n, 2 k , (l/ / y) h , (l/a) s , log(l/<5)) examples and running in 
time n s • poly(m, n). 

In other words, given access to t biased oracles with biases separated by 7 > 0, the class of fc-juntas 
is learnable in time n k l l poly(n, 2 k , j~ k ). We should mention that we must have 7 > 2/t to be able to 
separate t biases, so that ^~ k > (t/2) k . If t = k, the running time is thus at least polynomial in 2 kXogk . 

Theorems 2 and 3 are valid even if the biases are not known to the learner in advance. This follows 
since given the promise that the examples are generated according to r-biased product distributions, the 
learner can efficiently approximate these biases to within high accuracy (even from unlabeled examples) 
and working with such approximate biases is sufficient to recognize non-vanishing Fourier coefficients of 
the true biases (see Section 6). 

It is observed in [23] that except for a set of measure zero of product distributions with bias vectors 
r = (n, ... ,r n ) G [—1, l] n (i.e., E[arj] = r^), every /c-junta / has nonzero correlation with each of its 
relevant variables. They concluded that for each such vector of biases, A; -juntas are learnable with confidence 
1 — 5 in time poly(2 fc , n, log(l/<5)). However, the correlations may become arbitrarily small, so that in order 
to identify nonzero correlations, these have to be approximated very precisely. As a consequence, the growth 
of the poly expression heavily depends on the bias vector r. More precisely, the running time depends on 
2 c ' k , where the constant c depends on the choice of r. 

When we restrict the product distributions to r-biased distributions, we can improve from a set of mea- 
sure zero of exceptional bias vectors to finitely many exceptional biases: for fixed k and arbitrary n, there 
are only finitely many critical biases r G (—1, 1) such that there exists a fc-junta / with f(i, r) = for all 
i G [n] . As an application, we show 
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Theorem 4. Let k G N. Then for all but finitely many biases r G (—1,1), f/iere erato a function t r : N — > N 
swc/i f/ifltf k-juntas are exactly learnable under the r-biased distribution in time t r (k) • poly(n, log(l/<5)). 

Note that, unlike this rather non-constructive result, our algorithm for the "multiple-oracles model" 
works for k arbitrary and unknown biased product distributions. 



1.3 Our methods 

Denote by E r [/] the expected value of /(x) under the r-biased distribution (r G (—1,1)). Our main 
technical tool is a formula that connects the higher-order derivatives of E r [/] with respect to r to the Fourier 
weights at certain levels of the Fourier spectrum. The formula is close in spirit to Russo's well-known 
formula for monotone functions and generalizations thereof to arbitrary bounded functions on the hypercube. 
Russo's formula [24] states that for monotone Boolean functions / : {—1, 1}™ — > {—1, 1}, 

-E r [/]=E r [/- 

i=l 

More generally, the following connection between the derivative of the expectation (with respect to the bias) 
and correlations between the function value and the variables is known (see Grimmett [15, Theorem 2.34]): 

±E r [f] = (1 " r 2 )- 1 Covx^J/tx),^] (1) 

(here, we have translated Grimmett's notation to our setting, and Cov denotes the covariance). Since 
f(i, r) = Cov r [/, Xi — r] = a" 1 Cov r [/, X{] (see Section 2.3), (1) can be rewritten as 

|-E r [/] = (l-r 2 )- 1 /^ f {hr) . (2) 

i=i 

Define the weight w s (f,r) of the s-th r-biased Fourier level of f as the sum of all r-biased Fourier 
coefficients at level s, i.e., 

w s (f,r)= f( S ' r ) ■ 

SC[n]:\S\=s 

We use the following generalization of formula (2) which we attribute to folklore (and to the best of our 
knowledge, has not been published before). 

Theorem 5 (Generalization of Russo's Formula). Let f : {—1, l} n ->!,«£ [n], andr* G (—1, 1). Then 



d s 
dr s 



si 



Theorem 5 follows from a similar statement for product distributions with arbitrary biases (see Proposi- 
tion 1). The second ingredient to prove Theorem 1 is the observation that we can write 



Er[/] = J> t (/,oy (3) 



t=o 



(see Section 2) and that this is a polynomial in r of degree at most deg(/). Moreover, this polynomial is 
constant (in r) if and only if / is constant. From Theorem 5, we obtain that if for some r*, the Fourier 
coefficients f(S, r*) vanish for all S C [n] with 1 < \S\ < s, then (t^/dr*)]^ [/] | ^ = for all t G [s], 
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i.e., r* is an s-fold root of the nonzero polynomial (d/dr)¥, r [f], which is of degree at most deg(/) — 1. 
Since there can be at most (deg(/) — l)/s roots of multiplicity s, this proves Theorem 1. To the best of our 
knowledge, this is the first application of Theorem 5 in theoretical computer science. Let us remark further 
that we obtain the following relationship between Fourier weights with respect to different measures as a 
consequence of Theorem 5 and Equation (3): 



1.4 Related Work 

If we restrict ourselves to subclasses of fc-juntas / : { — 1, l} n — > {—1, 1} such as monotone or symmetric 
juntas (i.e., juntas invariant under permutations of the relevant variables), there do exist at least partially 
satisfying solutions to the junta learning problem: under the uniform distribution, monotone /c-juntas are 
learnable in time poly(n,2 fe ) from poly(logn, 2 k ) examples [23] and symmetric juntas are learnable in 
time n°( fc//logfc ) poly(n, 2 k ) [20, 22]. Furthermore, results for other learning more general classes under 
fixed product distributions have been obtained [14, 16, 25, 9], including the polynomial time learnability of 
monotone 0(log 2 nj log 2 log n)-juntas. Notably, also parity juntas, i.e., parities of subsets of at most k vari- 
ables, are efficiently learnable from product distributions (even in the presence of attribute and classification 
noise), with the restriction that every variable has a non-zero bias [2]. 

Recently, Atici and Servedio [3] have studied the junta learning problem for the case that the learner 
has access to a uniform distribution PAC oracle plus a quantum oracle. They showed that A;-juntas are 
learnable within accuracy e from 0(e~ 1 k log k) quantum examples and 0(2 k log(l/e)) classical (uniformly 
distributed) examples, both bounds being independent of n. Given this dramatic speed-up (which is impos- 
sible to achieve from classical queries only), we ask the more realistic question what can be done if we are 
given access to multiple classical oracles. 

Interestingly, our results are obtained in terms of purely statistical evaluation of the given data, i.e., one 
can interpret the Fourier algorithm as a statistical query (SQ) algorithm with respect to several distributions. 
While in the original SQ model [18], in which queries are evaluated with respect to the uniform distribution 
on the input space, (parity) juntas are provably not efficiently learnable [5, 8, 23], our results show that such 
a lower bound is not valid if queries are evaluated with respect to several distributions. 

1.5 Organization of this Paper 

We introduce all necessary prerequisites in Section 2. In Section 3 we present the generalization of Russo's 
formula. The reduction to identifying only one relevant variable is shown in Section 4. In Section 5, 
we prove Theorem 3 that addresses learnability via the s-th level Fourier algorithm from several oracles. 
Section 6 shows that the biases do not have to be known in advance. Finally, we prove Theorem 4 in 
Section 7.1 and propose open problems in Section 7.2. 

2 Preliminaries 

2.1 General Notation, Juntas, and Probability Theory 

Let N = {0, 1,2,.. .}, and for n G N, let [n] = {1, . . . , n}. We use boldface letters such as r, x, and a 
to denote (real) vectors of length n. The corresponding entries are denoted by r«, Xj, cij, and so forth. For 
x G {—1, +l} n and i G [n], denote by x^ the vector x with the sign of the i-th entry flipped. 




t=s 



(4) 
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Definition 1 (Relevant variables). Let / : {—1, l} n — > { — 1, 1}- For i G [n], the function / depends on 
variable x« (equivalently, Xj is relevant to /) if there exists an x G {-1, l} n such that /( X W) / /(x). 

Definition 2 (Junta). Let / : {— 1, l} ra — > {—1, 1} and k G [n]. The function / is a k-junta if it depends on 
at most A; variables. 

Let a?i, . . . , x n be independent random variables taking values —1 and +1 with E[a?j] = rj G [—1,1]. 
The value rj is called the Was of x,. Equivalently, Pr[xj = —1] = (1 — rj)/2 and Pr[xj = 1] = (1 + rj)/2. 
In this way, {—1, l} n is equipped with the product measure p r , r = (n, . . . , r n ), given by 

n 

^r(x) = n(( 1 +^)/2) 

i=i 

for x 6 {—1, l} ra . For / : {—1, l} n — > R, we denote by E r [/] the expectation of / with respect to pr- 
Furthermore, for /, 5 : {-1, l} n -»• E, let 

Cov r [/,<?] = E r [(/ - E T [f]){g - E r [g])} = E r [/ ■ g] - E r [/] • E r [ 5 ] 

denote the covariance of / and g with respect to p r . Denote by <jj = (1 — rj) 1 / 2 the standard deviation 
of X{ and let <x = (a±, . . . , a n ). We will mostly be interested in the case that all biases rj are equal. For 
r G [— 1, 1], let r = (r, . . . , r) be the vector that consists of n entries that are all equal to r. In this case, we 
write a = a(r) = \Jl — r 2 . We will frequently use that if |r| < 1 — a for some a > 0, then a > y/a. The 
measure /i F is called the r-biased product distribution. We also write /i r instead of E r instead of E F , etc. 

2.2 Learning Theory 

We introduce an extension of the classical PAC model [27]. Let C = {J ne ^C n be a class of functions, 
where each C n contains some functions / : {—1, l} ra — ► {—1, 1} and let M = {J neK M. n be a class of 
input distributions, where each M n contains distributions on {—1, l} n . For / G C n and a distribution 
ji G M- n , denote by EX(f, p) an oracle that on request generates x G {—1, l} ra according to p and returns 
the example (x, /(x)). For r G [—1, 1], we call EX(f,p, r ) an r-biased oracle. Let us first review the 
original PAC model. The class C is PAC-learnable under distributions M if there is an algorithm A that 
for all n G N, all functions / G C n , and all distributions p G M n on { — 1, l} n , given 5, e > and access 
to EX(f, p) but no further knowledge on / and p, outputs a hypothesis h : {—1, l} n — > {—1, 1} such that 
with probability at least 1 — 5 (taken over all random draws of the oracle), Pr x ~ M [/i(x) / /(x)] < e. If 
M. n is the class of all distributions on {—1, 1}™, we say that C is distribution-free PAC-learnable. If Jvi n 
only contains the uniform distribution on {—1, l} n , we say that C is uniform distribution PAC-learnable. If 
_M n is the class of all r-biased product distributions p r on { — 1, l} n , we say that C is learnable from biased 
distributions. If A even manages to output exactly /, (i.e., e = 0), we say that C is exactly learnable. 

The performance of a learning algorithm is measured by the number of examples it requests and by its 
running time, both of course depending on 5, e, n, and possibly further parameters involved in the definition 
of the class C. 

Now we study what happens if, instead of having access to a single oracle EX(f,p), we admit the 
learning algorithm to have access to multiple (pairwise different) oracles EX(f, pi), pi G M n for i G [t]. If 
we do not impose any restrictions other than being pairwise different on the distributions pi, then the learner 
does not gain any power since the distributions could be arbitrarily close to each other. Thus, we allow the 
running time to depend on the minimum distance 7 between pairs of distributions (at this point, we leave 
open the choice of appropriate distance measures). 
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The notion of learnability is the same as above, except that we require that the hypothesis output by a 
learning algorithm has to satisfy with probability at least 1 — 5 that Pr^^ [h(x) / f(x)\ < e for all i G [t\. 
In this case, we say that C is PAC-learnable from t oracles under distributions M. with separation 7. 

In the following, we motivate in which variants of this very general new learning model we are interested. 
Our goal is to find efficient learning algorithms for the class of k -juntas. More precisely, for a non-decreasing 
function k : N — > N, we want to learn the class C = [J n£ f$C n , where C n consists of all fc(n)-juntas 
/ : {—1, l} n — ► { — 1, 1}- The fastest known (exact) learning algorithm for C in the uniform distribution 
PAC-learning model runs in time n°- 7k poly(n, 2 k , log(l/<5)) [23]. Moreover, for k G w(l), there is not any 
explicit distribution [i for which C is known to be PAC-learnable under [i in time t(k) • poly(n, log(l/5)) 
with an arbitrary function t : N — > N. It thus seems reasonable to ask if we can do any better if we are 
given access to more than one oracle with several simple distributions (possibly known to the learner). We 
will show that this is in fact the case if the distributions are biased product distributions \i Ti with well- 
separated biases r\, even without prior knowledge on the biases (except that each |rj| should be bounded 
away from 1). Consequently, we manage to learn efficiently in the model of PAC-learning from multiple 
biased product distributions. The separation of biases will be reflected in the dependence of the running 
time on 7 = min^ \r{ — rf\. 

2.3 Fourier Coefficients 

For t 6 R" and S C [n], define ts = Yiies m particular, for x G {—1, l} n , X5 is the parity of bits in 
x indexed by S, and for r G [-1, l] n , E r [x s ] = llies^r^j] = *s- For i G VA and r G (-1, l) n » define 
Xi(x,r) = {xi-ri)/ai and for S C [n], let xs(x, r) = ]J ieS Xi(x,r). 
The measure /i r induces the inner product 

</,<7>r = Er[/-0]= ^ /x r (x)/(x) 5 (x) 
xe{-i,i} n 

on M^ -1 ' 1 ^™. The associated norm is 

||/|| 2 ,r = (/,/)r /2 =Er[/ 2 ] 1/2 - 
The functions xs = Xs( - > r )» C [n], form an orthonormal basis of this space with respect to (■, -) r : 

/ \ p [ 21 TT ^r[(xi - n) 2 } 

{XS,Xs)r = ^riXsl = II 2 = 1 ' 

cr- 
ies 1 

and if i £ S \ T for some sets S, T C [n], then E r [x^Xt] = E r [xi]E r [x 5 \{i}XT] = since E r [xi] = 0. 

We can expand any function / : {—1, l} n — > IR as a linear combination of the functions X5( - 5 r )» called 
the Fourier expansion of / with respect to fi r : 

f = ^2 (/>XS>rXS > 

SC[n] 

and we call /(S 1 , r) = (/, xs)r the Fourier coefficient of f at S with respect to p, r . Note that Xi{-, r ) is a 
linear function in and thus xs(-, r ) is a multi-linear polynomial in the variables Xi, i <E S (of degree |5|). 
Consequently, the Fourier expansion (with respect to any r € (—1,1)™) provides a representation of / as a 
real multi-linear polynomial of degree 

deg(/, r) = max{k G [n] \ 3S C [n\ : \S\ = k A /(S, r) ^ 0} . 
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Since this degree does actually not depend on r (there is exactly one polynomial representation of /), we let 

deg(/) = deg(/,0). 

If S = {i} is a singleton set, we also write f(i, r) instead of f({i}, r). Note that for S / 0, f(S, r) = 
Cov r [/, xsl; r )] since E r [xs(-, r)] = 0. Put in another way, 



al 5 l-/(5,r) = Cov r 



. ies 



In case we consider the r-biased product measure for some r G (— 1, 1), we call f(S, r) = f(S,r) an r- 
biased Fourier coefficient. In particular, /(0,r) = (/, l) r = E r [/] (again using r as subscripts rather than r). 
For the uniform measure ^ with r = 0, the Fourier expansion of / directly results in the representation 
of / as a real multilinear polynomial in canonical form (i.e., a linear combination of monomials xg) since 
Xs(x, 0) = x 5 : /(x) = Esc[n] f(S, 0) • x s . Since for r G [-1, If, E r [x 5 ] = r s , we obtain 



Er[/]= E 

SC[n] 

of which (3) is the special case r = r for r G [—1,1]. 

The weight of the z-th Fourier level of / with respect to fx T is defined to be 



(5) 



5C[n]:|5|=i 



Lemma 1. Let f : {— 1, 1} T 

Proo/ We have 



{ — 1, 1}. IfY17=i w i(f> 0) = 0> / ^ constant. 



/(n= E Mo) = E^(/.°) 

5C[n] 



i=0 



is either lor- 1. Thus, if Z7=iMf^) = 0, then |/(0, 0)| = |w (/,0)| = 1, i.e., / = 1 or / = -1. □ 

The connection between juntas and Fourier coefficients is given by the following characterization of 
relevant variables: 

Lemma 2 ([2, 23]). Let f : {—1, l} n — > { — 1, 1}, r G (—1, 1), and i G [n]. Xj is relevant to f if and 
only if there exists S C [n] with i £ S and f(S, r) / 0. 

In particular, if f(S,r) / for some S C [n] and some r G (—1, 1), then all variables Xi, i G S, 
are relevant to /. Thus, one way to find relevant variables is to look for non-vanishing Fourier coefficients. 
Furthermore, if / is a k -junta, then f(S,r) = for all S with |5| > k, i.e., looking at coefficients up to 
level k is sufficient for finding all relevant variables. 

2.4 Sampling Fourier Coefficients 

To approximate biased Fourier coefficients, we will make use of the Hoeffding bound [17]: 

Fact 1 (Hoeffding bound, [17]). Let Xi, i G [m], be mutually independent random variables taking values 
in [a, b], a < b. Then for any e G [0, 1], 



Pr 



E^-E E ^ 



i=l 



> em 



< 2 exp 



-2me 2 

Jb^ay- 



8 



Lemma 3. Let f : {— 1, 1}™ — > {-1, 1}, r G (-1, 1), S C [n], owe? (5 > 0. G/ven access to EX(f,r), we 
can estimate f(S, r) within accuracy e > Ofrom m = poly(2' s '', (l/cr)'' 5 ' , log(l/<5), 1/e) examples in time 
0(m ■ n) with confidence 1 — 5, provided that r is given exactly. 

Proof. Draw m = 2 • ln(2/<5) • (2l 5 l/e) 2 • (l/cr) 2 l 5 l examples (x*,/(x*)) from EX r (f). Define A = 
(max^ e{ _ lil} \ Xi - r\)\ s \ = (1 + |r|)l 5 l < 2l 5 l Let <?(x) = <tI 5 I/(x*)x 5 (x', r) G [—A, A]. Then, by 
Fact 1, 



1 f/t 



m 
t=i 



< ecr 



with probability at least 1 — 5. □ 

We will deal with the case that r is not exactly given in advance in Section 6. To distinguish the cases 
f(S, i) = and f(S, i) / 0, we also need that a non-vanishing f(S, i) is not too small. For this, we will 
use the following (straightforward) lemma: 

Lemma 4. Let h 6 R[x] &e a polynomial of degree d with leading coefficient b and roots t\, . . . ,td G C. 
Let t £ f and e > such that \t - ReU\ > eforall i G [d\. Then \h(t)\ > \b\ ■ e d . 

Proof. Since h(x) = b • Uie[^ x ~ U), W)\ = \b\ ■ U ie[d] l< " H > \b\ ■ Uie[d] I* " Re M > H " D 
2.5 Derivatives 

For a fc-fold differentiable function / : R n — > R and 5 = {n, . . . C [n] with pairwise different 
elements i,-, denote by = g^~dx~f tne ^" tn or der partial derivative with respect to , . . . , 

Lemma 5. Le? 5 G R[ii, . . . ,t n ] be a multilinear polynomial ( i.e., all exponents are at most one) and define 
h G R[t] by h(t) = g(t, ...,t). Then 

SC[n]:\S\=k 

Proof. The easy way to see the claim is to simply apply the chain rule. For multi-linear polynomials, 
though, we can as well check the claim "by hand": By linearity of the construction of h, it suffices to check 
the claim for the case that g is a monomial. Without loss of generality, assume that g(t\, . . . , t n ) = t\ . . . t^. 
Let S C [n] with \S\ = k. If S % [£], then clearly (d k /dt s )g = = (d k /dt k )h. If S C [£}, then 
(d k /dt s )g{t, ...,*) = t e ' k , so that 

fc! . y ^ g{ t...,t) = k\-(% e - k = —^—t e - k 

^ dt s \kj (£-k)\ 

SC[n]:\S\=k a 



On the other hand, h(t) = t e and thus (d k /dt k )h{t) = £■{£ —1) ■...■{£ — k + 1) ■ t e - k = j^t l ~ k . □ 

3 An Extension of Russo's Formula to General Product Distributions and 
Higher Order Derivatives 

In this section, we derive our connection between derivatives or E r [/] and Fourier levels. In particular, we 
prove Theorem 5 stated in Section 1. 
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Proposition 1. Let f : {-1, l} n -> R, S 1 C [n] |5| = k, and r* G (-1, l) n . 77zera 



5r s 



ErW M . = II( 1 - r ? 2 r 1 -Cov I 



x i -r i 



Proof. Expanding / with respect to /U r * , we see that 



Er[/] = £ Mr*)E r [ X5 (-,r*)] = £ /(S,r*)[] 



r,- — r • 



ca- 



sein] 5C[n] ieS 

is simply the Taylor expansion of the multi-linear polynomial E r [/], and the claim follows. 
Putting together Proposition 1 and Equation (5), we obtain the relationship 

f(S,r) = <T S J2f(T,0)r T \s- 

TDS 



□ 



(6) 



Theorem 5 in the introduction now follows from Proposition 1 and Lemma 5, and (4) is a special case of (6). 



4 Identifying One Relevant Variable Is Enough 

In analogy to Proposition 6 in [23], we show that if we have an algorithm that identifies just one rele- 
vant variable of a non-constant fc-junta / using m = poly(logn, 2 k , (l/a) k , log(l/<5)) examples from 
EX(f, ri), . . . , EX(f, r t ) (where a > bounds away the biases from —1 and 1) in time n@ k poly(m, n), 
then we can construct an algorithm that identifies all relevant variables and outputs the truth table of / using 
m' = t -poly (log n, 2 k , (l/a) k , log(l/<5)) examples in time rfi k poly(m', n) (for the same f3, but a different 
polynomial): 

Proposition 2. Let Abe an algorithm that, given access to EX(f, ri), . . . , EX(f, r t ) for some non-constant 
k-junta f : { — 1, l} n — ► {—1, 1} and some r £ (—1 + a, 1 — a) (a > 0) and given 5 > 0, outputs with 
probability at least 1 — 5 one relevant variable of f using m = poly(logn, 2 fc , (l/a) k , log(l/<5)) examples 
in time n l3k ■ poly(m, n). Then there is an algorithm B that, for any k-junta f : { — 1, l} ra — ► { — 1,1}, given 
access to EX(f, r±), . . . , EX(f, r t ) and 5 > 0, outputs with probability at least 1 — J all relevant variables 
and a truth table of f, using ml = t ■ poly(logn, 2 fc , (l/a) k , log(l/<5)) examples in time n^ k poly(m', n). 

Proof. The proposition can be proved by an adaption of the proof of Proposition in [23], so we only point 
to the necessary modifications of the latter. First, if / is non-constant, then each output value f(x) is drawn 
from EX n {f) with frequency at least (min{(l - n)/2, (1 + rj)/2}) fc > {a/2) k . Thus, the check for 
constancy with confidence 5 requires 0((2/a) k log(l/<5)) examples and poly((2/a) fc , n, log(l/<5)) steps. 

Next, for restrictions f\ p of / fixing at most k variables, each simulation of a draw from EX n (f\ p ) 
requires the draw of 0((2a) k log(m/<5)) from EX n (f). 

Since A is run at most k2 k times with confidence 1 — 5/ (k2 k ) each, it suffices to draw 

0{m{2/af \og(mk2 k /5)) = m\og{m/5) poly(2 fc , (l/a) k ) 

examples from each oracle EX(f, r^) (note that A run on different restrictions may ask m examples from 
different oracles). 

Finally, to read off a truth table of / from the examples, poly((2/a) fc , {1/5)) examples (from any of the 
oracles) are again sufficient to ensure with probability 1 — 5 that every possible assignment of the relevant 
variables appears in the examples. The claim follows since m = poly(log n, 2 k , (l/a) k , log(l/5)). □ 
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5 Learning Relevant Variables via the s-th Level Fourier Algorithm 



The goal of this section is to prove Theorem 3. For s £ [n], let 

T^s(f) = { r o G (—1; 1) I r o = Re(r) for some root r G C of -^-E r [/] of multiplicity at least s} . 

By Theorem 5, 1Z s (f) contains all r G (—1, 1) such that wi(f,r) = ... = iu s (/, r) = and in particular 
all r € (-1, 1) for which /(S, r) = for all 5 C [n] of size 1 < |5| < s. 

Lemma 6. Let f : {— 1, 1}™ — > {—1, +1} &e a non-constant k-junta, s £ [k], and r £ (—1, 1) such that 
dist(r,7£ s (/)) > 7 > 0. 77ien there exists S C [n] witfi 1 < |5| < s rac/i f/iaf > a s (-f/4) k . In 

particular, all variables vwY/z i £ S are relevant. 

Proof. Let ro = r. Let g(r) = E r [/]. By (3) and Lemma 1, g is a non-constant polynomial of degree 
d = deg(g) < deg(/) < with leading coefficient Wd{f, 0). Let t > 1 be minimal with (<j*/cir*)g| r=ro / 
0. Since ro lZ s (f), t < s. Let /i = (d t /dr t )g. Then /t is a non-zero polynomial of degree d — t < 
deg(/) — t<k — t. The highest coefficient of h is 6 = d ■ (d — 1) ■ . . . ■ (d — t + 1) ■ 0). By Lemma 4, 
IM r o)| > H • 7 d ~*- Since ^ is a non-zero integer multiple of 2~ k , \b\ > ^ t y 2~ k . By Theorem 5, 
h(ro) = tla^wtif,^), so that 

K(/,r )| = (t!)-V*|/i(ro)| > ^2"*^ . 
Hence there exists S C [n] with |5| = i such that 

l/(5)| > Q '(Jjr^V-' > Q) _1 (7/2)V > ( 7 /4) fc a s . 

□ 

For the remainder of this section, we assume that a learning algorithm has exact knowledge of all biases. 
However, we will show in Section 6 that this assumption is not necessary. 

Proposition 3. There is an algorithm such that if f : { — 1, l} n — > { — 1, +1} is a non-constant k-junta and 
r £ (—1 + a, 1 — a) (for some a > 0) is such that dist(r, lZ s (f)) > 7 for some 7 > 0, having access 
to the oracle EX(f, r), for any 5 > outputs at least one relevant variable of f with probability at least 
1 — 5 using m = poly(log n, 2 k , (l/ 7 ) fc , (l/a) s , log(l/<5)) examples and running in time n s ■ poly(m,n). 
Furthermore, for arbitrary r 6 [— 1 + a, 1 — a], with probability at least 1 — 6, any variable output by the 
algorithm is relevant. 

Proof. By Lemma 6, there exists 5 C [n] with 1 < \S\ < s such that \f(S, r)| > a s (~f/A) k > a s / 2 (7/4) fc . 
Thus, it suffices to estimate all coefficients f(S, r), S C [n] with 1 < \S\ < s, within accuracy a s / 2 (7 /A) k /2, 
each with confidence 1— 5-n~ s , to identify (with probability at least 1—5) at least one S such that f(S, r) / 
with confidence 1 — 5. This takes poly(2l 5 'l, (l/a)' 5 ', log(n s /(5), (4/ 7 ) fc (l/a) s ) examples from the oracle 
EX(f,r) by Lemma 3, and we can reuse the same examples to estimate all coefficients (since we use a 
union bound for the confidence). Overall, the number of examples used is 

m = poly(logn,2 fe , (l/ 7 ) fc , (l/a) s , log(l/5)) . 

The algorithm outputs all variables Xi for which it finds a nonzero Fourier coefficient f(S) with i G S. Since 
we have to check J2i=i © = 0(n s ) coefficients in the worst-case, the running time is bounded above by 

n s ■ poly(m, n). 
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For the second part of the claim, note that if f(S,r) = (and especially, if S contains an index i 
of some non-relevant variable), then the estimate for \f(S,r)\ will with high probability be smaller than 

a s/2( 7 / 4 )fc/2. □ 



Theorem 6. Let s,t G [k] such that s ■ t > k, a, 7 > 0, and —1 + a < 77 < ... < 77 < I — a with 
r,j + \ — Tj > 7 for all j G [t — 1]. Then there is an algorithm that, for any non-constant k-junta f : 
{—1, 1}™ — > { — 1, +1}, given 5 > a«<i having access to the oracles EX(f, 77), . . . , EX(f, 77), outputs a 
relevant variable of f with probability at least 1 — 5, using m = poly(logn, 2 k , (l/j) k , (l/a) s , log(l/5)) 
examples and running in time n s ■ poly(m, n). 

Proof. Let h(r) = w\ (f, r) jo = (d/dr)K r [/]. Since h is a nonzero polynomial of degree at most deg(/) — 
1 < k — 1 and since s • t > k, h has less than t roots of multiplicity at least s. Consequently, there exists 
j G [t] such that dist(rj, lZ s (f)) > 7/2. Running the algorithm from Proposition 3 for every single bias rj, 
j G [t], (each time with confidence parameter 5/t, reusing the same examples) yields the claim. □ 

Proof of Theorem 3. Theorem 6 shows that it is possible to identify at least one relevant variable from the 
claimed number of examples in time n s ■ poly (to, n). By Proposition 2, the claim follows. □ 

We note that since h(r) is of degree at most deg(/) — 1, it actually suffices to have s ■ t > d oracles if 
we are given the promise that deg(/) < d. 



6 Biases Unknown in Advance 



The algorithms provided in Section 5 require that all biases 77 are precisely known to the learner. As one 
might expect, this assumption is not necessary since a learner can get good estimates of the biases from 
(unlabeled) random examples. The main technical issue is now to show that using good estimates r\ still 
leads to sufficiently close approximations of the Fourier coefficients with respect to the true biases 77. For 
this it suffices to show that xs(-, r i) an d Xs(-, r[) are close in L 2 . 

Lemma 7. Let a, 7 > 0, r, r' G (—1, 1) such that \r\ < 1 — a and \r — r'\ < 7, S C [n]. Then 

II / /\ / \ll 1*^1 ^ 

Wxs(;r)-xsMh,r < ^I7v77- 

To prove Lemma 7, we will first compute, given r, r' G (—1, 1)™, the Fourier coefficients of xs("> r ') 
with respect to fi r . Although we only need r = r and r' = r' for our applications, we state the result for 
general bias vectors r and r' since the proof does not simplify for the special case. 

Lemma 8. Let r, r' G (-1, l) n and S,T C [n]. Then 
XsOy)(r,r) = <xs(-,r , ),XT(-,r))r 



ifT % S 

^(r-r') S \T ifTCS. 



Proof We have E r [ Xl (-, r')] = {n - r^/a',, E r [ Xi (-,r)] = 0, and E r [xi(-,r) • XiM] = °iK- The 
claim now follows from 

(xs(-,r'),XT(-,r)) r = E r [xs(-, r') ■ X t(; r)] 

= Er[Xi(;r')}- U E r [x,(-,r)]- [] E r [ Xl (-, r) • Xl (-, r')] . 

ie5\T ieT\s ie5nT 

□ 
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Now we bound the L 2 -norm of the difference between xs("> r ) an d Xs("> r ')- Here we do restrict our- 
selves to r = r and r' = r' to avoid an increase in technicality: 

Proof of Lemma 7. By Parseval's equation, 

llxs(-,r)- X s(-,r')||l )r = £ (x^O^O-X^l^r)) 2 . 

TC[n] 

By Lemma 8 and since xs(-, r)(T, r) = unless T = S, all summands for T % S vanish. Furthermore, 
Lemma 8 states that for T C S, xs(vr')(T, r) = {^/a' 3 )^ - r') s ~*, where we let s = \S\ and t = \T\. 
Thus, 

\\Xs(;r')-Xs(;r)\\l r < ( X 7{^')(S, r) - l) 2 + £ l££?)(T, rf 

TCS 

= {a')- 2s [{a s -a^f + i^+^f -a 2s ] . 

Now we use the following two facts: 

Fact 2. For any a,b £ [0, 1] with \b — a\ < p, \a s — b s \ < s ■ p. 

Proof. Let a < b. Then by convexity of the function x i-> x s , b s < a s + sb s ~ l (b — a) < a s + s5. □ 
Fact 3. If\r' — r\ < 7, then \a' — a\ < 7/0". 



Proof. Let a(r) = y/l — r 2 . The derivative of a is (d/dr)a(r) = — Since cr is concave, we have that 
for any 5 such that r, r + 5 € (—1, 1), cr(r + 5) < a(r) + (d/dr)a(r)S = a(r) — rS/a(r). Since \r\ < 1, 
the claim follows with r' = r + 5, \5\ < 7, a' = cr(r'), and a = a(r). □ 

Let p = 7a -1 / 2 . By Fact 3 and since cr 2 = 1 — r 2 > 1 — r > a, \a' — a\ < p. From Fact 2, we obtain 
\a' s — a s \ < sp and (a 2 + 7 2 ) s — (a 2 ) s < S7 2 . Consequently, 

<y' 2s \\xs{-,r') ~ Xs(;r)\\l r < (sp) 2 + s 7 2 = s^/a + s 7 2 < (s + l) 2 7 2 /a . 

This proves the lemma. □ 

As a corollary, we obtain an estimate of how well (/, x(-, r')) r approximates f(S, r): 

Corollary 1. Let f : {-1, l} n -> {-1, 1}, 7 > 0, r, r' G (-1, 1) such that \r' - r\ < 7, a«<i 5 C [n]. T/ien 



< 

a 



l/2 a 't 5 l 7 



</,Xs(-y)>r-/(5,r) 
Proo/ By Cauchy-Schwartz, 

(/, X(;r')) r - f(S, r)\ = \ (f, xs(; r') - X s(; r)) r \ < \\f\\ 2 ,r \\xs(; r') - X s(; r)\\ 2 , r ■ 
The claim follows since ||/|| 2jr . = 1. □ 
Next we show how to closely approximate f(S, r) given no a priori knowledge on r: 
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Lemma 9. Let f : {-1, l} n -> {-1, 1}, a > 0, r G [-1 + a, 1 - a], 5 C [n], and 5 > 0. Given 
access to EX(f, r), we can estimate f(S, r) within accuracy efromm = poly(2l s l, (l/a)l s l,log(l/<5), 1/e) 
examples in time 0{m ■ n) with confidence 1 — 5 without any a priori knowledge on r. 

Proof. Let 7 = a(l s l +1 )/ 2 /( 2 (l 5 l + 1)) < a 1 / 2 ^^ / (2\S\ + 1), so that, in particular, 7 < a/2 < cr 2 /2 
(note that we may assume \S\ > 1 without loss of generality). First, we approximate r to within 7 by 
requesting m x = 81n(4/5)/ 7 2 = poly(|5| 2 , (1/q)I 5 I , log(l/<5)) examples (x*,/(x*)) from EX(f,r) to 
compute r' = (l/m\) Ylt^i x \- With probability at least 5/2, \r' — r\ < 7. 

Now, letting o(x) = 0-'l s l/(x*)xs(x*, r'), <p = (m 2 a'\ s \y l Y!t=i #( x *) approximates (f,xs(-,r')) r 
within accuracy e/2 given 7712 = poly(2l s l, (l/o-') 2|5| ,log(l/(5), 1/e) examples. Since 

a' > a - 7/(7 > a/2 > a 1/2 /2 

implies (l/o-') 2|S| < (4/a)l s l,m 2 is dominated by poly(2^l , (l/a)l 5 l , log(l/<5), 1/e). Finally, 

10 - r)| < |0 - (/, xsO, r')) r \ + I (/, xs(-y ))r - to, r)\ < e/2 + (\S\ + l)^ 1 / 2 a'~\ s h < e . 

The total number of examples to be drawn is maxjmi , 777,2}, which is of the order indicated in the claim. □ 

Using Lemma 9 in place of Lemma 3 shows that Proposition 3, Theorem 6, and finally also Theorems 2 
and 3 even hold if the biases are not known in advance (except for the bound \ri\ < I — a). 

7 Further Results and Open Problems 

7.1 Learning in Polynomial Time for All But Finitely Many Biases 

We have seen that for each fc-junta /, there are at most k — 1 biases in (—1, 1) for which w\(f, r) = 0. Since 
for the r-biased product measure, w±(f, r) does not depend on where the relevant variables are hidden, it is 
not hard to see that there are at most (k — 1) ■ 2°( k ) biases for which there exists some /c-junta / (for any 
n) with wi(f, r) = 0. Let us call these biases critical. Let Sk denote the set of biases r G (—1, 1) such 
that there exists a function t r : N — > N and a /c-junta-learning algorithm that learns from EX(f, r) in time 
t r (k) • poly(n). Then St is exactly the complement of the critical points. This is because the minimum 
distance between any two distinct critical points is a function of k only. This proves Theorem 4 stated in the 
introduction. Consequently, for each k, there are only finitely many biases for which junta-learning may not 
be feasible in time polynomial in n. The next step (left for future research) is to find lower bounds on t r (k). 

Generalizing to arbitrary product distributions with bias vector r G (—1, 1)™, we obtain that wi(f, r) is 
zero only for a set of biases of measure zero (since it is the zero set of a non-constant multi-linear polyno- 
mial). Considering the polynomials af(i) separately for each % € [n], we recover the statement of [23] that 
f(i, r) = for all i 6 [n] only for a set of measure zero. 

7.2 Open Problems 

Next to the notoriously hard problem of designing more efficient algorithms for the junta learning problem 
under the uniform distribution, it would also constitute considerable progress to have, for any concretely 
given fixed bias r / 0, some algorithm improving over the n k bound. Note that we have shown in Section 7. 1 
that for all but finitely many r, the degree-one algorithm works. However, it is not clear how to decide in 
general whether a given bias is critical. We believe that the relationship (3) between Fourier coefficients 
with respect to different biases could be useful to this end. 

In a different direction, it seems worthwhile to further study our newly introduced model of learning 
from multiple oracles. Can we show positive results for other learning problems that appear to be hard in the 
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classical PAC setting? In particular, is there an efficient algorithm for learning DNFs or decision trees from 
multiple distributions? What general conditions on the distributions are required to make efficient learning 
possible? As the number of oracles obviously constitutes a significant resource parameter, it is natural to 
ask if polynomial time learning of juntas is also possible from o(k) oracles (maybe at least for important 
subclasses). 
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